Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds transfers between stores to external attachments #1358

Open
wants to merge 50 commits into
base: main
Choose a base branch
from

Conversation

Spoffy
Copy link
Contributor

@Spoffy Spoffy commented Jan 4, 2025

Context

This PR follows on from #1320, and adds support for:

  • Moving attachments between stores
  • Basic configuration of external attachments using environment variables.

This allows a user to change the document's store, then transfer all of the attachments from their current store(s) to the new default.

This includes transfers from internal (SQLite) storage to external storage, external to internal, and external to external (e.g MinIO to filesystem).

This PR also introduces the concept of "labels", which are an admin-friendly way to refer to a store, and map 1-to-1 with sttore IDs. Labels don't need to be unique between instances, only within an instance.

User-facing changes:

  • Adds API endpoints to :
    • Migrate all attachments from their current store to the store set for that document
    • Check on the status of transfers
    • Get and set the store for a document
  • Adds an environment variable for setting external attachments behaviour GRIST_EXTERNAL_ATTACHMENTS_MODE
    • The only valid value is test currently, which sets Grist to use a temporary folder in the filesystem.

Internal changes:

  • Adds methods to AttachmentFileManager to facilitate transfers from one storage to another.
  • Exposes new methods on ActiveDoc for triggering transfers, retrieving transfer status and setting the store.
  • Refactors how DocStorage provides access to attachment details
  • Adds a way to retrieve attachment config from env vars, and use them to decide which stores will be available.

All of the logic behind these changes should be documented in the source code with comments - if anything is unclear, that would be great feedback.

There a few large block comments (particularly in AttachmentFileManager) that might be a good place to start when reading this code.

Related issues

#1320
#1021

Has this been tested?

  • 👍 yes, I added tests to the test suite
  • 💭 no, because this PR is a draft and still needs work (some tests missing)

@Spoffy Spoffy changed the base branch from main to spoffy/external-attachments-prototype January 4, 2025 04:23
@@ -76,6 +76,9 @@ export interface IAttachmentStore {
// implementation and gives them control over local buffering.
download(docPoolId: DocPoolId, fileId: FileId, outputStream: stream.Writable): Promise<void>;

// Remove attachment from the store
delete(docPoolId: DocPoolId, fileId: FileId): Promise<void>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually used anywhere. I've left it in for now (since it's already implemented), but we could remove it?

@berhalak berhalak self-assigned this Jan 7, 2025
@Spoffy Spoffy force-pushed the spoffy/external-attachments-transfers branch from 9231ba5 to 70f707c Compare January 8, 2025 02:44
Base automatically changed from spoffy/external-attachments-prototype to main January 8, 2025 16:43
@Spoffy Spoffy force-pushed the spoffy/external-attachments-transfers branch from 70f707c to cbaf592 Compare January 8, 2025 16:45
Copy link
Collaborator

@fflorent fflorent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, that looks promising :).

Here are my remarks so far (I need to continue reading the code and start looking at your tests).

app/server/lib/AttachmentFileManager.ts Outdated Show resolved Hide resolved
app/server/lib/AttachmentFileManager.ts Outdated Show resolved Hide resolved
* @returns {Promise[Boolean]} True if the file got attached; false if this ident already exists.
*/
public findOrAttachFile(
fileIdent: string,
fileData: Buffer | undefined,
storageId?: string,
shouldUpdate: boolean = false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am usually not very comfortable with boolean flag params, due to their lack of expressivity.

But am I right to say the shouldUpdate may be the default? I only see the flag is kept to false only in tests.

In any case, I would suggest for adding more expressivity the following, if that makes sense to you too:

  • make this function private;
  • introducing 2 new functions named updateOrAttachFile (where you call the private function with shouldUpdate = true) and attachFileIfNew (with shouldUpdate = false);
  • maybe (or maybe not, that's arguable if this is now private) change shouldUpdate: boolean into { shouldUpdate = false }: {shouldUpdate: boolean } = {} or similar;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting thoughts here! I thought a lot about how to split this - I don't really like the flag either, but ended up in the "tangled implementation" case of that article you linked! 🙂

I'd be happy to do it the way you suggest - make the flagged method private, and expose it as two different public methods. It absolutely feels easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split this up fully, and separated out the implementation in a way that I think it makes it easier to read. 🙂

Let me know if that works for you!

let isNewFile = true;

try {
// Try to insert a new record with the given ident. It'll fail UNIQUE constraint if exists.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment that used to be line 793 gave some reason that is still useful, I think. So I would add some more context:

Suggested change
// Try to insert a new record with the given ident. It'll fail UNIQUE constraint if exists.
// Try to insert a new record with the given ident. It'll fail UNIQUE constraint if exists.
// Even when attempting to attach a new file exclusively (and do nothing when it exists),
// it's a good idea to first check the existence of the fileIdent and if not insert the file and its data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is still relevant after the other changes made to this function? :)

await db.run('UPDATE _gristsys_Files SET data=?, storageId=? WHERE ident=?', fileData, storageId, fileIdent);
}

return isNewFile;
Copy link
Collaborator

@fflorent fflorent Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is acceptable as it is for me, so please skip if you don't agree much about my idea.

I feel like, even it introduces a supplementary SQL query to maintain, we would increase the readability by separating the query to check the existence of the file and do the INSERT/UPDATE. Am I right to say the below code would be equivalent?

const exists = await db.run('SELECT 1 from _gristsys_Files where ident = ?', fileIdent); // Not very sure what is returned here
if (exists && shouldUpdate) {
  await db.run('UPDATE _gristsys_Files SET data=?, storageId=? WHERE ident=?', fileData, storageId, fileIdent);
} else {
  await db.run('INSERT INTO _gristsys_Files(ident, data, storageId) VALUES (?, ?, ?)', fileIdent, fileData, storageId);
}
return exists;

Copy link
Contributor Author

@Spoffy Spoffy Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look equivalent to me. I'm not sure why it was originally done by catching the unique constraint, but it looks like the only place in the code where that's done.

I'll make this change, and we'll see if @paulfitz or anyone has any opinion :)

Edit: Change is pushed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! If there exist doubts, feel free to rollback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We hit an error in one of our test cases due to this change, full writeup is below.

Long story short: The way we handle transactions means we can hit a race condition in this example, and catching the "UNIQUE" constraint violation is the simplest way of avoiding that.

I'll change this back, and add a comment explaining why we handle it this way.


Full explanation:

The SQLiteDB class' function execTransaction is built to merge all calls into a single transaction, rather than run a transaction per-call (which is how I believed it worked).

When several atttachments are being inserted simultaneously in one API call, and because Node's async behaviour is unpredictable, the individual SQL statements are effectively re-ordered at random due to multiple promises running in parallel. This means that even if the SELECT statement says "No records exist", they might be inserted before the "INSERT" actually happens, resulting in a UNIQUE constraint violation.

If we're adding two files at the same time, the resulting order of statements ended up going something like this:
SELECT 1 as fileExists FROM _gristsys_files WHERE ident = ?
SELECT 1 as fileExists FROM _gristsys_files WHERE ident = ?
INSERT INTO main._gristsys_Files (ident, data, storageId) VALUES (?, ?, ?)
INSERT INTO main._gristsys_Files (ident, data, storageId) VALUES (?, ?, ?) // Unique constraint error thrown here.

If they hypothetically had been separate transactions on separate DB connections, the second INSERT would have failed with a "database is locked" error instead - so that's actually no better here.

Hence, catching the UNIQUE constraint error is the cleanest way to avoid this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, thank you for your feedback! It makes sense

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new version is much simpler to read, thanks a lot! 🙏

app/server/lib/AttachmentFileManager.ts Outdated Show resolved Hide resolved
app/server/lib/AttachmentFileManager.ts Outdated Show resolved Hide resolved
*/
public getFileInfo(fileIdent: string): Promise<FileInfo | null> {
return this.get('SELECT ident, storageId, data FROM _gristsys_Files WHERE ident=?', fileIdent)
public getFileInfo(fileIdent: string, includeData: boolean = true): Promise<FileInfo | null> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think of the following to avoid the boolean flag param:

  public getFileInfo(fileIdent: string): Promise<FileInfo | null> {
    return this._getFileInfo(fileIdent, 'ident, storageId');
  }
  public getFileInfoWithData(fileIdent: string): Promise<FileInfo | null> {
    return this._getFileInfo(fileIdent, 'ident, storageId, data');
  }
  // ...
  private _getFileInfo(fileIdent: string, columns: string): Promise<FileInfo | null> {
    return this.get(`SELECT ${columns} FROM _gristsys_Files WHERE ident=?`, fileIdent)
      .... // just like what you did
   }

Copy link
Contributor Author

@Spoffy Spoffy Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think splitting them up is a good idea - it definitely makes it more readable.

I think I'd rather duplicate the original function, than pass columns as a string.

That's because there's no risk of bad data accidentally being loaded into the code that formats the result into a typed object (e.g, a missing column).

Additionally, because it lowers the risk of someone doing something silly (such as passing a user-provided column name) in the future.

Copy link
Contributor Author

@Spoffy Spoffy Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done, please let me know if the refactored version is clearer :)

return {
fileIdent,
isNewFile: !fileExists,
};
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a bit hard to read due to all of the conditions entangled. The good news is that it is private and you have a public method to call it, so you can rework it serenely.

I feel like it would be less complex to split this function into two different ones. I made the following, but have not tested:

  public async addFile(
    storeId: AttachmentStoreId | undefined,
    fileExtension: string,
    fileData: Buffer
  ): Promise<AddFileResult> {
    const fileIdent = await this._getFileIdentifier(fileExtension, Readable.from(fileData));
    return storeId ? 
      this._addFileToExternalStore(storeId, fileIdent, fileData) : 
      this._addFileToLocalStore(fileIdent, fileData);
  }


  // ...

  private async _addFileToExternalStore(
    destStoreId: AttachmentStoreId,
    fileIdent: string,
    fileData: Buffer
  ): Promise<AddFileResult> {
    const destStore = await this._getStore(destStoreId);
    if (!destStore) {
      this._log.warn({ fileIdent, storeId: destStoreId }, "tried to fetch attachment from an unavailable store");
      throw new StoreNotAvailableError(destStoreId);
    }
    const fileInfoNoData = await this._docStorage.getFileInfo(fileIdent, false);
    const fileExists = fileInfoNoData !== null;
    if (fileExists) {
      // File is already stored in a different store (e.g because store has changed and no migration has happened
      const isFileInTargetStore = destStoreId === fileInfoNoData.storageId;

      // Only exit early if the file is stored elsewhere of if the file exists in the store, 
      // otherwise we should allow users to fix any missing files
      // by proceeding to the normal upload logic.
      const fileAlreadyExistsInStore = isFileInTargetStore && await destStore.exists(this._getDocPoolId(), fileIdent);

      if (!isFileInTargetStore || fileAlreadyExistsInStore) {
        return {
          fileIdent,
          isNewFile: false,
        };
      }
    }
    // There's a possible race condition if anything changed the record between the initial checks
    // in this method, and the database being updated below - any changes will be overwritten.
    // However, the database will always end up referencing a valid file, and the pool-based file
    // deletion guarantees any files in external storage will be cleaned up eventually.
    await this._storeFileInAttachmentStore(destStore, fileIdent, fileData);

    return {
      fileIdent,
      isNewFile: !fileExists,
    };
  }

  private async _addFileToLocalStore(
    fileIdent: string,
    fileData: Buffer
  ): Promise<AddFileResult> {
    const fileInfoNoData = await this._docStorage.getFileInfo(fileIdent, false);
    const fileExists = fileInfoNoData !== null;
    if (!fileExists) {
      await this._storeFileInLocalStorage(fileIdent, fileData);
    }
    return {
      fileIdent,
      isNewFile: !fileExists,
    };
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've split this into two, and there's a few duplicate comments now - but I think that's reasonable, as it's definitely a lot easier to understand the logic now.

}
}

// There's a possible race condition if anything changed the record between the initial checks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo:

Suggested change
// There's a possible race condition if anything changed the record between the initial checks
// There's a possible race condition in anything changed the record between the initial checks

Copy link
Contributor Author

@Spoffy Spoffy Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a typo here, but I'll clean up the comment to be more intelligible!

It's trying to say:

There's the potential for a race condition here, if any other code modifies the database record between the the checks at the start of this function, and this line of code. The checks could pass then, but fail if evaluated here (due to record changes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased to this:

// A race condition can occur here, if the file's database record is modified between the
// `getFileInfoNoData` call earlier in this function and now.
// Any changes made after that point will be overwritten below.
// However, the database will always end up referencing a valid file, and the pool-based file
// deletion guarantees any files in external storage will be cleaned up eventually.

@Spoffy Spoffy force-pushed the spoffy/external-attachments-transfers branch 4 times, most recently from 227350c to 586d5a4 Compare January 14, 2025 05:52
@fflorent
Copy link
Collaborator

We have successfully run your PR with a heavy document! 🎉

Here are feedback:

  • ⚠️ We faced an error due to this little mistake: Adds transfers between stores to external attachments #1358 (comment) :)
  • ✅ We successfully run the POST /api/docs/:docId/attachments/transferAll endpoint
  • ⚠️ We could not call the GET /api/docs/:docId/attachments/transferStatus endpoint, we don't know yet the cause
  • ❓ After the transfer, we had to request a vacuum on the doc to effectively shrink it, we wonder if there are good reason for not doing that immediately at the end of the operation (the vacuum is expected to be relatively fast if we have saved much space).
  • ℹ️ After the vacuum, the doc shrank from 860Mb to 24Mb

And we're very glad of the progress, thank you @Spoffy! 🙏

Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! @Spoffy what do you think about changing the Promise.all in ActiveDoc around here:

      const userActions: UserAction[] = await Promise.all(
        upload.files.map(file => this._prepAttachment(docSession, file)));

... and just doing that work sequentially? I think it would be hard to catch race conditions in the code, and there doesn't seem much reason to try to squeeze out some extra speed right here.

app/server/lib/ActiveDoc.ts Show resolved Hide resolved
interface AttachmentFileManagerLogInfo {
fileIdent?: string;
storeId?: string | null;
}

interface AttachmentFileInfo {
ident: string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of codebase uses semicolon style for interfaces (I think?) unless there's a reason not to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also sometimes small non-exported interfaces or classes are put down below the "meat" of the file, so that they don't distract, and that might be nice here (but this is a very weak suggestion, do what you like).

* - Avoid data loss at all costs (missing files in stores, or missing file table entries)
* - Always be in a valid state if possible (e.g no file entries with missing attachments)
* - Files in stores with no file record pointing to them is acceptable (but not preferable), as
* they'll eventually be cleaned up when the document pool is deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this.

@@ -99,6 +132,9 @@ export class AttachmentFileManager implements IAttachmentFileManager {
this._docPoolId = _docInfo ? getDocPoolIdFromDocInfo(_docInfo) : null;
}

// This attempts to add the attachment to the given store.
// If the file already exists in another store, it doesn't take any action.
// Therefore, there isn't a guarantee that the file exists in the given store, even if this method doesn't error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for adding this.

if (hasExternal) {
return DocAttachmentsLocationSummary.EXTERNAL;
}
return DocAttachmentsLocationSummary.INTERNAL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the zero-attachment case too right? I wonder if the zero-attachment case should be external if a default external store is set. Edge case, don't know if there is scope for confusion, feels unimportant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way we'd need to do this is to add a new state "NO_FILES", return that and either:

  • Render it how we like in the frontend
  • Or translate it in ActiveDoc to either "EXTERNAL" or "INTERNAL"

I don't mind doing that (tiny bit of work) - what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of the summary is to tell if we need transfer or not. So this needs to be sorted out.

} catch(e) {
this._log.warn({ fileIdent, storeId: targetStoreId }, `transfer failed: ${e.message}`);
}
finally {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the finally on same line, as } finally {, like the catch.

app/server/lib/AttachmentFileManager.ts Outdated Show resolved Hide resolved
const fileMetadata = await this._docStorage.getFileInfo(fileIdent, false);
// This check runs before the file is retrieved as an optimisation to avoid loading files into
// memory unnecessarily.
if (!fileMetadata || fileMetadata.storageId == newStoreId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We avoid == in this codebase because anytime it is used reviewers have to think too much.

throw new StoreNotAvailableError(storeId);

// A race condition can occur here, if the file's database record is modified between the
// `getFileInfoNoData` call earlier in this function and now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ActiveDoc has a well worked out mechanism to avoid parallel changes to the document (see the Share class). The underlying SQLite database also applies changes sequentially. It feels like something has gone wrong somewhere, there is some parallelism introduced for attachments, that is making your life harder.

Copy link
Contributor Author

@Spoffy Spoffy Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is theoretical for the most part - I think it might be possible if several add requests for the file come in at once, and we don't apply the share class to the attachment API in ActiveDoc? Hard to be sure though without studying the code very carefully to trace all paths to this function.

I could probably remove this without any harm, but I thought it might be worth pointing out just in case in comes at some point in the future. Could also add "theoretical" to the start of the comment.

Not sure what the best practice here is, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the user actions applied in ActiveDoc.addAttachments should happen via the Share class ultimately. I feel that the Promise.all in there is silly. But it does up the odds of tickling race conditions like this that might be hard to spot otherwise, so that's good.

Ok, assuming this was the only race, fine to leave as is.

this._app.get('/api/docs/:docId/attachments/transferStatus', isOwner, withDoc(async (activeDoc, req, res) => {
res.json({
status: activeDoc.attachmentTransferStatus(),
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about sticking in locationSummary for consistency with /transferAll?

@Spoffy Spoffy force-pushed the spoffy/external-attachments-transfers branch from a56cfa0 to 337c128 Compare January 15, 2025 23:38
}));

// Returns the status of any current / pending attachment transfers
this._app.get('/api/docs/:docId/attachments/transferStatus', isOwner, withDoc(async (activeDoc, req, res) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the endpoint registered above (line 521), it will be matched first and this handler won't be executed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nicely spotted, I made the mistake of assuming it was done by specificity. Thanks 🙂

interface AttachmentFileManagerLogInfo {
fileIdent?: string;
storeId?: string | null;
export enum DocAttachmentsLocationSummary {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type should be in common (UserAPI), it is duplicated there.

Also there we have this type:

export interface AttachmentLocationSummary {
  summary: 'INTERNAL'|'EXTERNAL'|'MIXED';
}

The type above looks artificial. In UI and probably elsewhere, the words internal,external,mixed describe the storage of attachments itself. So here I think we don't need to add this summary at the end. Either name it Location, Storage, or add (to the AttachmentLocationSummary) some other bits (like number of files per storage) to make it true summary.

Also in UserAPI there is:

export interface AttachmentTransferStatus {
  status: {
    pendingTransferCount: number;
    isRunning: boolean;
  };
}

This additional property (status) that wraps the content is absolutely not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clean up "DocAttachmentsLocationSummary" - rename it to DocAttachmentsLocation and move it to UserAPI.ts.

I'm not sure where you're seeing export interface AttachmentTransferStatus, that's not showing up in any my searches. Is it something you might have added in the UI code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done now!

interface AttachmentFileManagerLogInfo {
fileIdent?: string;
storeId?: string | null;
export enum DocAttachmentsLocationSummary {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at StringUnion.ts in common. I think we prefere it over enums.

}));

// Returns the status of any current / pending attachment transfers
this._app.get('/api/docs/:docId/attachments/transferStatus', isOwner, withDoc(async (activeDoc, req, res) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have something called docChatter (look at CommTypes.ts). This is a way to inform the clients about some backend status in more deterministic ways (so that it works across multiple tabs/clients etc). Before UI lands this AttachmentFileManager needs to leverage it, and inform about the background job status through this channel.

If you want, I can take this task and do it in my PR for the UI.

But this api endpoint still needs to stay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be fine to keep the live status part to your PR @berhalak so it lands with the UI. The grist.gouv team would like the functionality in this PR in for the next Grist release if possible.

* - Always be in a valid state if possible (e.g no file entries with missing attachments)
* - Files in stores with no file record pointing to them is acceptable (but not preferable), as
* they'll eventually be cleaned up when the document pool is deleted.
*
*/
export class AttachmentFileManager implements IAttachmentFileManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface is never used anywhere, and it conveys a false impression that AttachmentFileManager has >=2 implementations or is injected somehow. Lets remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

if (hasExternal) {
return DocAttachmentsLocationSummary.EXTERNAL;
}
return DocAttachmentsLocationSummary.INTERNAL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of the summary is to tell if we need transfer or not. So this needs to be sorted out.

Copy link
Collaborator

@fflorent fflorent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look, everything seems in good shape. I still need to review the tests, I'll come ASAP for that (unless other people review them in the meantime, I don't want my review to be blocking)

@@ -218,3 +463,22 @@ export class AttachmentFileManager implements IAttachmentFileManager {
};
}
}

async function validateFileChecksum(fileIdent: string, fileData: Buffer): Promise<boolean> {
return fileIdent.startsWith(await checksumFileStream(Readable.from(fileData)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking: I expect this to work (just to have a strict check and not to rely on startsWith):

return path.parse(fileIdent).name === await checksumFileStream(Readable.from(fileData));

But feel free to skip this comment, I don't expect much inconvenience with this implementation :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this won't work! fileIdents have the file extension (e.g .grist or .txt or .png) appended to them by existing code - they aren't purely the hash.

It's not something we can easily change, as it's like that in all existing grist docs - so this is what we've got to do for now!

await db.run('UPDATE _gristsys_Files SET data=?, storageId=? WHERE ident=?', fileData, storageId, fileIdent);
}

return isNewFile;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new version is much simpler to read, thanks a lot! 🙏

@Spoffy Spoffy marked this pull request as ready for review January 22, 2025 01:50
@Spoffy
Copy link
Contributor Author

Spoffy commented Jan 22, 2025

@fflorent - Could you re-test and ensure everything works as expected now, including the vacuuming? :)

Otherwise this is ready to go, pretty much. There's two things that would be nice to do:

  1. Tests on ActiveDoc's methods
  2. @ActiveDoc.keepOpen to keep the document alive while transfers are happening.

Waiting for a response from Paul tomorrow for 2. 1 is optional, as I could roll it into the config PR instead when it would be a little easier.

@fflorent
Copy link
Collaborator

fflorent commented Jan 22, 2025 via email

@Spoffy Spoffy force-pushed the spoffy/external-attachments-transfers branch from 42a7a6f to 6e39ba0 Compare January 24, 2025 00:42
@fflorent
Copy link
Collaborator

Could you re-test and ensure everything works as expected now, including the vacuuming? :)

I'll do that when I can in the beginning of the next week, I am in a hurry these days, sorry for that!

@Spoffy Spoffy force-pushed the spoffy/external-attachments-transfers branch from c0e2231 to 47c2078 Compare January 28, 2025 14:08
@Spoffy Spoffy force-pushed the spoffy/external-attachments-transfers branch from b6dad1e to d85ea0e Compare January 28, 2025 18:41
Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(few small things noticed in passing)

@@ -481,6 +482,11 @@ interface SqlResult extends TableRecordValuesWithoutIds {
statement: string;
}

export const DocAttachmentsLocation = StringUnion(
"NO FILES", "INTERNAL", "MIXED", "EXTERNAL"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at other string unions in this codebase I think they lean camelCase. Also maybe just none instead of noFiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right - I'll swap the casing around here :)

}
await this.applyUserActions(docSessions, [
// Use docInfo.id to avoid hard-coding a reference to a specific row id, in case it changes.
["UpdateRecord", "_grist_DocInfo", docInfo.id, { documentSettings: JSON.stringify(settings) }]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see a case of it changing?? (this is just idle curiosity, code is fine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, I don't - just wanted to avoid the magic constant of -1, which felt a bit like leaving a timebomb lying around.

}
const transferPromise = this._performPendingTransfers();
const newTransferJob: TransferJob = {
promise: transferPromise,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the nesting? Could newTransferJob not be assigned transferPromise directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was anticipating adding more info to this later - but I can simplify for now and deal with that if / when we get to it.

data: Buffer;
}

interface TransferJob {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of this interface compared to just a Promise? Could it just be type TransferJob = Promise<void>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as above :)

@paulfitz
Copy link
Member

Tested the functionality, and it looks good, with the main thing missing being a way to run it with something like GRIST_EXTERNAL_ATTACHMENTS_MODE=snapshots to get back the ability to run against an S3-compatible store configured for snapshots. More nuance can come later, but just having test doesn't meet the bar of becoming usable.

For test, I may have missed how to locate where the files will be stored (it isn't super important once a mode like snapshots is available though).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants